Skip to content

GPT-OSS prompt caching fix#297

Open
dmitryryabkov wants to merge 2 commits intolmstudio-ai:mainfrom
dmitryryabkov:fix/prompt-caching-clean
Open

GPT-OSS prompt caching fix#297
dmitryryabkov wants to merge 2 commits intolmstudio-ai:mainfrom
dmitryryabkov:fix/prompt-caching-clean

Conversation

@dmitryryabkov
Copy link
Copy Markdown

Fixes lmstudio-ai/lmstudio-bug-tracker#1697

Fixed prompt caching for GPT-OSS 20B MLX models.

Two fixes:

  1. cache_wrapper.py: Added fallback when cache layers don't expose offset attribute
  2. batched_model_kit.py: Separated cross-prompt cache key from live cache key

The main issue was that batched models (like GPT-OSS) were tracking generated tokens in the cross-prompt cache key, preventing cache hits for new prompts with overlapping content.

Testing:

  • Unit test added in tests/test_cache_wrapper.py
  • Verified locally with GPT-OSS 20B model in LM Studio (replaced the contents of ~/.lmstudio/extensions/backends/vendor/_amphibian/app-mlx-generate-mac14-arm64@20/lib/python3.11/site-packages/mlx_engine/ with the updated file)

Two fixes:
1. cache_wrapper.py: Added fallback when cache layers don't expose `offset` attribute
2. batched_model_kit.py: Separated cross-prompt cache key from live cache key

The main issue was that batched models (like GPT-OSS) were tracking generated tokens in the cross-prompt cache key, preventing cache hits for new prompts with overlapping content.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Mar 29, 2026

All contributors have signed the CLA ✍️ ✅
Posted by the CLA Assistant Lite bot.

@dmitryryabkov
Copy link
Copy Markdown
Author

I have read the CLA Document and I hereby sign the CLA

@github-actions github-actions bot added the CLA signed Indicates that all contributors have signed label Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA signed Indicates that all contributors have signed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Prompt caching doesn't work for MLX version of GPT-OSS 20B

1 participant